巴西专利BR112015007306A2 acoustic echo cancellation system, fast fourier transform, acoustic echo cancellation method, signal

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
acoustic echo cancellation method. a system and method are presented for acoustic echo cancellation the echo canceller performs reduction of acoustic and hybrid echoes that may arise in a situation such as a long-distance conference call with several speakers in different environments, for example. echo cancellation, in at least one modality, can be based on similarity measurement, statistical determination of historical values echo cancellation parameters, frequency domain operation, jumble detection, packet loss detection, signal detection and noise subtraction.
公开号:BR112015007306A2
申请号:R112015007306
申请日:2013-10-22
公开日:2020-04-22
发明作者:Nagaraja Lyer Ananth；Ganapathiraju Aravind；Immanuel Wyss Felix；Charles Vlack Kevin；Vergin Rivarol；Cheluvaraja Srinath
申请人:Interactive Intelligence Inc；
IPC主号:

专利说明:

DESCRIPTIVE REPORT
ACOUSTIC ECO CANCELLATION METHOD
FUNDAMENTALS [0001] The present invention generally relates to telecommunication systems and methods, as well as communication networks. More particularly, the present invention relates to the elimination of echo along communication networks.
SUMMARY [0002] A system and method are presented for canceling acoustic echo. The echo canceller performs reduction of acoustic and hybrid echoes that can arise in a situation such as a long distance conference call with several speakers in different environments, for example. Echo cancellation, in at least one modality, can be based on the measurement of similarity, statistical determination of echo cancellation parameters of historical values, frequency domain operation, gibberish detection, packet loss detection, signal detection and noise subtraction.
[0003] In one embodiment, an acoustic echo cancellation system is described, comprising: means for audio input; means for generating an audio signal from said audio input; means for transmitting said audio signal; means for converting said audio signal into a frequency domain; means for performing similarity measurement; means for performing delay estimation; means for performing echo estimation parameters; means to perform statistical echo validation; means to detect speech; and, means to detect gibberish.
7/11
2/25 [0004] In another modality, a method for the cancellation of acoustic echo is described, comprising the steps of: initializing echo model parameters; analyze audio for speech; determine if speech was detected, in which if speech was not detected, continue to analyze said audio for speech; estimate echo delay and validate said echo model if speech was detected; determine if echo is present, where echo is not present, continue to analyze the audio for speech before continuing the process and repeat the process from step c; determine if gibberish is present, where gibberish is present, compute parameters for echo with gibberish and if gibberish is not present, compute parameters for common echo; perform echo subtraction; track the echo and update said echo model; and, determine if the echo is still present, in which: if echo is not present, start the method again; and, if the echo is present, repeat the method starting with step f).
[0005] In another modality, an acoustic echo cancellation system is described along communication networks, comprising: means for audio input; means for generating an audio signal from said audio input; means for transmitting said audio signal; means for converting said audio signal from a time domain to a frequency domain; means for carrying out one or more of; similarity measurement and delay estimation; echo estimation parameters; statistical echo validation; means to detect speech; and, means to detect gibberish.
[0006] In another modality, a method for canceling acoustic echo is described, comprising the steps of: transforming an audio signal; start echo model parameters; analyze said audio signal for speech; detect presence of speech; estimate echo delay and validate said echo model; detect the presence of echo; detect the presence of gibberish; compute parameters for at least one of: eco
3/25 with gibberish and echo; subtract the echo from the audio signal; update said echo model; and _s determine whether the presence of echo is reduced.
BRIEF DESCRIPTION OF THE FIGURES (0007] Figure 1 is a diagram illustrating an echo mode.
[0008] Figure 2 is a diagram illustrating a modality of the operation of an echo canceling system.
(0009] Figure 3 is a diagram illustrating a modality of the operation of a modified echo canceling system.
[0010] Figure 4 is a diagram illustrating a modality of the similarity measure.
[0011] Figure 5 is a diagram illustrating a modality of the components of a similarity module.
[0012] Figure 6 is a modality of a histogram.
[0013] Figure 7 is a flow chart illustrating a modality of the echo cancellation process.
[0014] Figure 8 is an illustration of a convergence time modality.
[0015] Figure 9 is an illustration of an echo cancellation modality with low or no convergence time.
[0016] Figure 10 is a diagram illustrating an echo modality through a VoIP network.
DETAILED DESCRIPTION [0017] In order to promote an understanding of the principles of the invention, reference will now be made to the modalities illustrated in the figures and specific language will be used to describe them. However, it will be understood that no limitation on the scope of the invention is therefore intended. Any additional changes and modifications to the described modalities and any other applications of the principles of the invention as described in this document
4/25 are contemplated, as would normally occur to someone skilled in the technique to which the invention refers.
[0018] Echo elimination is desired to correctly deliver phone calls in environments such as conference calls. The use of hands-free devices during telephone calls, such as conference calls, can cause an echo. For example, the call from the far end caller is emitted by the speakerphone, or the cell phone without hands and then repeats itself reverberating across the surfaces of the room. This results in an echo. The echo can then be picked up by a microphone from the far end. A repetition cycle can be created in which the caller at the far end hears an echo of his own voice. Delays of more than 1 second (s) have resulted in some situations, such as conference calls, involving international participants.
[0019] Failure to remove, or cancel, the echo of a call can often result in a significant deterioration in the quality of the call. The uncontrolled and variable nature of acoustic and hybrid environments can result in complex echo patterns such as long delays, time dependence on echo effects, echo tails, frequency dependent echo and echo distortion. For example, previous echo cancellation means usually cannot detect a very low level echo that can occur based on the network configuration.
[0020] An acoustic echo cancellation digital signal processing technique can be used to stop the repetition and allow clear communication. Networks, such as VolP networks ₍ are often noisy with signals undergoing minimal to moderate degradation. Means for echo cancellation must operate in the presence of noise. Said means must also be able to take into account packet loss effects and latency occurring in these networks.
5/25 operations performed by an echo canceller must be efficient without adding any noticeable delay for signal processing, [0021] An echo canceller (EC) can function as a signal processing operation that eliminates echoes of signals received through communication networks, such as VoIP and public switched telephone networks (PSTNs), or at an endpoint, such as a telephone device, for example. Generally, an EC performs reduction of acoustic and hybrid echoes resulting in configurations such as conference calls with speakers in different environments. Acoustic echo is generated when a signal transmitted from a near-end speaker is picked up by the microphone from the far-end speaker and returns to the near-end speaker as part of the signal from the far-end speaker. The terms near end and far end are usually defined in relation to the EC under discussion, which may be operating at both ends of the communications network. Another source of echo may be the hybrid echo which is a reflection of electrical energy from the far end due to changes in the wiring properties of PSTNs.
[0022] Most existing methods of echo cancellation use time domain methods or use a two-signal Discrete Cosine Transform cross correlation to determine the delay. In at least one modality, the EC performs a statistical determination of the effective filter parameters making them more robust in the presence of signal noise and long delays.
[0023] Echo cancellation can be performed on some systems by a dedicated microprocessor, for example, the Texas Instruments TMS320C8X, as the algorithm requires computations in quantities of more than 10 million instructions per second. In a VoIP network, however, dedicated microprocessors cannot be used because the entire system resides on a server or a computer.
6 / 2δ With regard to VolP networks. problems must be considered, for example: if the VolP network adds its own delay over the normal delay associated with the echo signal. signal compression artifacts introduced by low bit rate codecs, which can increase degradation, and the inherent unreliability of IP networks that can result in packet loss. It is also desirable to handle multiple instances (for example, hundreds of full-duplex calls) of the echo canceller simultaneously on a single server.
(0024] Those skilled in the art will recognize from the present disclosure that the various methodologies disclosed in this document can be implemented on a computer using many different forms of data processing equipment. Equipment may include digital microprocessors and associated memory to run appropriate software programs, to name just a non-limiting example. The specific form of the hardware, firmware and software used to implement the currently disclosed modalities is not critical to the present invention.
[0025] Figure 1 is a diagram illustrating an echo modality in a communication network usually indicated at 100. An example of a communication network can include, but is not limited to, a VolP network. The signal transmitted from the near end is represented as TX. The signal received from the far end with additional echo, 125, is represented as RX. The network 110 through which the far end signal 120 travels, also transmits the acoustic echo 115. As TX 105 travels through the network 110, echo 115 is generated by the microphone of the far end speaker and sent to the far end speaker. near as part of the far end speaker signal. Algaravia may result from the presence of the echo signal in addition to the received far-end speech. Thus, the received signal 125 contains an echo 115.
7/25 [0026] Figure 2 is a diagram illustrating a typical operation mode of an echo canceling system, usually indicated at 200. The near-end signal 210 can be represented by X «), while the far-end signal 250 can be represented by χ (η) + Φ). q near-end signal 210 can be generated by an audio input 205, an example of what a person speaking is asking for. The unwanted echo 216 asks to be represented as Φλ The echo canceller uses transmitted signal 210 and received signal x (nRr (a), 250, to estimate so that the echo canceller can remove it. The signal can be superimposed with the unwanted echo 216 on microphone 255 after traveling through echo path 230 of speaker 215. A device remote 260 can contain microphone 255 and speaker 215. Remote device 260 can echo over the system, [0027] The near-end signal .v («) 210 may be available as a reference signal for the canceller echo 200. It can be used by echo canceller 200 to generate an estimate of echo 225, which is represented as the estimated echo is subtracted from the far end signal added to the echo to produce the transmitted far end signal 240, «00 , during the removal phase echo signal 245. Thus, the transmitted far end signal 240, «(») can be represented as * 4 »~ X») * K «) - r00 as the echo estimator, or the Adaptive Filter NLMS 220, as illustrated, you need to see to estimate F (n> ideally, any residual signal, represented as e (n) r (n.) - r (>) must be very small or inaudible after echo cancellation, as the signal reaches the output audio, 235, an example of which can be a receiver.
[0028] The adaptive filter 220 of minimum standardized mean quadratic error (NLMS) 220 can use an algorithm that is a variant of the minimum mean square error algorithm (IMS) and may take into account the
8/25 input signal strength 210. The IMS algorithm can be an adaptive algorithm that uses a slope-gradient gradient method. The adaptive filter adjusts its coefficients to minimize the mean square error between its output and that of an unknown system. Echo cancellation is performed in the time domain on an example basis for example.
[0029] Echo delay occurs when the originally transmitted signal reappears in the transmitted or received signal. The echo delay of VoIP networks can become very large due to several factors. The network path 265 can be an example of such a factor that is responsible for the length of the echo delays. A longer network path 265 can mean a longer echo delay. Delays of more than 1s were observed. In a time domain implementation, these long echo delays would require the NLMS 220 filter to have a large number of coefficients (taps) to cancel the echo. Such long filters require computational effort that is excessively expensive and impractical to estimate.
[0030] Figure 3 is a diagram illustrating a modality of operation of a modified echo cancellation system, usually indicated at 300. In this diagram, the adaptive filter NLMS 220. as shown in Figure 2 is replaced by other components that may include: Fast Fourier Transform (FFT) Modules 305a and 305b, a Similarity Measure and Delay Estimation Module 310, a Statistical Echo Validation Module 315 and an Echo Parameter Estimation Module 320. Although the diagram shows the Echo Parameter Estimation Module 310, Statistical Echo Validation Module 315 and the Similarity measure and Delay Estimation Module 310 as being grouped into a single 306 module, this is done for clarity and they do not need to be grouped as such. All operations in the present invention are performed in the frequency domain,
9/25 using a Fast Fourier Transform (FFT) module 305a, 305b _f to convert the signal instead of the time domain, such as was previously used in Figure 2.
[0031] The Delay Estimation and Similarity Measure module 310 uses a similarity measure that performs fewer operations than a classic NLMS algorithm. This is instead of the extensive multiplications and sample additions required to use an adaptive NLMS filter to be able to handle a delay of more than 1s, as illustrated in Figure 2, [0032] Echo delay can refer to the time it takes for the transmitted signal to reappear on the received signal. The delay estimation is performed using an algorithm that can detect an echo with a delay greater than 1s and allows the system's capacity to perform echo cancellation on many full-duplex calls on a single computer. In order to recognize an echo, in at least one mode, the most recent frames of the far-end signal are kept in the Frequency Domain. These frames, represented by A /, with N “100 can represent an audio signal block of about 1 _s 5s. The most recent frames of the near end signal represented in the frequency domain are maintained. These frames, represented by K, with K ~ 5 can represent an audio signal block of about 80 milliseconds ms. N - K comparisons between more recent K frames of the far end and near end signals are examined as follows:
[0033] ~ 4 wiiÀ í ® í, x _W - / t [0034] If WT0) is less than a threshold for * ~ 1 then an echo is present, where t represents an index ranging from 1 to A / - K, and where it also represents an index used in the sum.
10/25 [0035] Figure 4 is a diagram illustrating a modality of the similarity measure, usually indicated at 400. The echo and window tracking behavior can be determined dynamically based on drift corrections and echo latency observed through of a statistical model. In cases where the delay is not known, in at least one mode, the search may cover / V frames 410. In a non-limiting example, let N be 100 frames, which may comprise about 1.5 s of the signal. Since the 415 echo delay is known, in order to reduce processing, the area around the delay is searched instead of the entire original N frames. Assuming that D represents the echo delay, the restricted search area is reduced to: 1 °] _{} and} ni that
M defines the search interval and is equal to 10. This comprises about 160ms of signal In at least one modality, it is assumed that, once the echo delay is found, its value can vary in an interval of ± 160ms . Computational load, therefore, can be reduced by a factor of three.
[0036] The similarity measurement processes and the reduction of the research area are performed in the frequency domain, so that each searched element can represent a Normalized Amplitude Frequency Vector. It should be noted that the sample values contained in this disclosure are specific to a specific implementation that works on signals with a sampling rate of 8kHz used in telephony. These values would be adapted for other sampling rates. In at least one modality, the Normalized Amplitude Frequency Vector can be represented in 128 bins. The differences 420 between frames K of the far-end speech (RX) 405, where RX is the far-end signal mixed with echo, and frames N of the far-end speech (TX) 410 are measured and added together. For each table, as represented by /, the difference equation can be defined as:
11/25 [0037] ~ 1 [0038] Where W) are, respectively, the amplitude values in bin K for the near-end signal X and the far-end signal ¥ for frame j. Without loss of generality, this equation can be rewritten as:
[0039] as »* 7 I < (O - W) I [0040] In this second equation, the value of is the same as in the first equation, except that the sum was divided into smaller elements. These partial sums can be represented by the equation:
[0041] I ® ~ [0042] Different numbers of elements can exist in different modalities; however, 32 increments are used in this instance. The partial sum as described above that ranges from n to n * 32 instead of the sum described in the first equation that ranges from 1 to 128, can be used in at least one modality.
[0043] In at least one modality, the measure of similarity can be computed every 4 frames per accumulation of the partial sum:
[0044] M «5, S7% w [0045] The similarity measure that is used to calculate the delay is then updated every 4 frames, for example. This small delay may allow the computational load to be reduced by a factor of 4 because 32 subtractions are done each time instead of 128.
[0046] In one mode, the band size can be 32 with a total of 4 bands, for a value of 128. The band size can be changed so that 16 bands can be chosen with a size 8 for the same total value 128. Depending on the type of echo observed in a network, spectral bands may overlap. The size of each band can increase or decrease based on the desired system performance. Bands don't need to be
12/25 necessarily adjacent. Spacing can also be used, such as each and every band, for example. This is illustrated in the following equations:
[0047] ~ r (W * n) j [0048] y _; (»'+«)] [0049] Figure 5 is a diagram illustrating a modality of the components of a Similarity Module 310, starting from figure 3, usually indicated in SOO. The similarity module can transform signals 505a RX and TX 505b into the frequency domain. In at least one embodiment, the transformation is performed using a 128 FFT bin. Both spectra are normalized (that is, the sum of the components in both spectral vectors is made equal to 1) 510a, 510b, in cases where the signal levels are different. When there is no gibberish present, the energy of the TX signal is greater than the energy of the RX signal, Passpass filtering is performed to eliminate any spectral regions in the signal that are not desired for the 515 similarity calculation. The 520 similarity value is then removed from the module. In at least one modality, the value or similarity measure, 520 is defined as the distance between two spectral vectors (for example, bin 128) averaged over five RX and TX frames. A value of less than 0.6, where 0.6 is a fixed threshold, can indicate echo, in at least one modality.
[0050] The similarity module can report the existence of echo for fc k * 2, £ * Squadros park W / 0) is less than a certain threshold for these frames. The similarity module can also inform if there is no echo for frames: t + Ut 3U +4 because Wf (ó is greater than the threshold for these frames.
[0051] These oscillations may not be considered echo. To validate the presence of echo, the statistical approach in the statistical echo validation model 315 can be based on the following assumption in at least one modality: there is an echo if for / V consecutive delays
13/25 estimated data using the Echo Delay Estimation and Similarity Measure K module if these delays / V have exactly the same value with the K / N ratio greater than 75%, [0052] A histogram is analyzed to extract the most accurate hypothesis likely from current data and provides a more accurate estimate of model parameters. With the approach described here, the echo delay can be determined in individual frequency bands or groups of bands instead of just the average delay, maintaining a histogram for each band or groups of bands. By analyzing the histogram for multimodal distributions, multiple echoes can be successfully extracted and removed. Echoes of different times can also be treated as long as the history values are chosen to track the change in filter parameters. In one mode, the statistics of the model parameters are stored in a circular memory for the 20 most recent frames (320ms) with the oldest values, being removed as the most recent data becomes available. In Figure 6, an example histogram of delay distribution is provided. The histogram illustrates the distribution of 20 estimated delays and the process by which to decide whether an echo is present. This histogram is provided as an example and is not intended to be a limiting factor. Because 15 of the 20 605 delays are between 11 and 15ms, the received signal may contain an echo with a delay of 12.5ms. Likewise, in one mode, it can be determined that there is no echo or echo is no longer present in the RX signal if less than 6 of the 20 estimated delays fall into the same bin. For example, 2 of the 20 delays are verified to be between 1 and 5ms, 610. This may indicate that there is no echo or echo is no longer present.
[0053] In at least one modality, the nature of the filtering to be applied to the far end signal needs to be defined so that it can resemble the echo present in the signal The filter can be a model of the speaker, microphone and the acoustic attributes of the room. Because the
14/25 system operates in the frequency domain, echo parameters can be maintained to simulate the filter characteristics when estimating 320 echo parameters (Fig. 3).
[0054] The loss of echo return can be described as the ratio between the level of the transmitted signal (TX) and the level of echo present in the reception signal (RX). It is expressed in decibels (dB). Knowing the loss of echo return by Frequency Bins allows the correct weighting of the near end (TX) signal for echo removal to make it similar to the echo, the normalized transfer function must also be evaluated in the frequency domain.
[0055] The filter used to modify the signal in order to obtain a reasonable estimate of the echo is characterized by the loss of echo return and the transfer function normalized in at least one mode. The FFT of the far end signal for frame number k can be represented by 7, the loss of echo return by «and the Normalized Transfer function between the far end signal and the near end signal $. Using digital signal processing, £ / £ is evaluated in dfí which is represented by:
[0056] [0057] Where X is the FFT of the near end signal. The modified, or filtered, distant end signal can therefore be given by the equation:
[0058] ^y »* ^H ** [0059] If delay D is taken into account, output 1 / can be represented as:
[0060] ”Xc-» [0061] The time domain signal «is obtained by
Fast Inverse Fourier Transform (IFFT):
[0062] i ^ W)
15/25 (0063] This operation results in a signal block of 256 samples that is superimposed and added to the previous block to form the output signal.
[0064] Figure 3 further illustrates speech detection modules as applied to at least one modality of echo canceller 300. This application must be run before it is determined whether to evaluate different echo parameters. Three speech detectors are illustrated in Figure 3: the speech detector RX 330, the speech detector TX 335 and the gibberish detector 340. Speech detection is based on the variability in the consecutive frame spectrum and the estimated signal strength. Detectors are generally designed around the principle that, if the signal level is higher than a certain threshold, then it is reasonable to assume that the speech is present. Combining the level of cam signal variation in the spectrum over several blocks allows greater precision and robustness in the system.
[6065] The RX 330 speech detector makes no distinction between α echo and the far-end speaker. RX speech can mean that the far-end speaker is speaking to an echo that is present. Because the echo level can be relatively low, the RX speech detector may be more sensitive than the other two speech detectors. If the RX speech is present, it can be considered that the level of the far-end speech is greater than the threshold of far-end speech or that the spectral variation of the far-end discurse is greater than the threshold of spectral variation. The value of these thresholds should be chosen such that the speech detector activates in low echoes while minimizing false activations in the background noise. If the thresholds are too small, the background noise detected by the microfane can result in false detection. If the threshold value is too large, part of the speech or part of the echo may not be detected.
16/25 [0066] TX TX 335 speech detector can perform a search for the presence of echo. The research asks to be triggered by the activity of the near-end speaker. If near-end speech is present, it can be seen that the level of near-end speech is greater than the threshold of near-end speech or that the spectral variation of the near-end speech is greater than the threshold of spectral variation. Thresholds may have higher values than those for far-end speech.
[0067] The gibberish detector 340 can determine whether both the far-end and near-end speech is present, Accurate gibberish detection in the presence of echo is necessary so that parameters are not changed based on a similarity calculation that no longer must be valid. The gibberish detection allows to control the amount of echo removed when the speech is present, in at least one modality. A 3c® signal over the echo is usually considered to be a gibberish indicator. It is assumed that gibberish is present if the far-end speech is present, near-end speech is present, and the far-end speech level is greater than the echo level, in addition to 3d8.
[0068] The similarity measure is also added within the system to measure the similarity between TX and RX with the appropriate delay to account for situations in which an echo can be higher and thus decreasing the reliability of the detection. For example, two distinct echo levels can be present in a conference call, such as when the first speaker is speaking. Speaker 1 can speak louder, so they can have a high level of echo. Speaker 2 can speak quietly and thus having a lower voice can result in a lower echo level than speaker 1 may have. The similarity value, in the presence of gibberish, is thus greater than in the case where there is only an echo. Bareback
17/25 minus one modality, a hysteresis in similarity values between 0, 65 and 0.85 is used to check gibberish beyond the 3 df restriction.
[0069] As illustrated in Figure 7, a modality of a process 700 for echo cancellation is provided, generally indicated at 700. Process 200 can be operative on any or all elements of system 300 (Figure 3). Echo cancellation itself can be defined as a subtraction in the frequency domain between the near end signal and the estimated echo as:
[0070] P-Y-f [0071] With [0072] = * »* [0073] In step 705, the echo model parameters are initialized. For example, Startup can be triggered by the signal being transformed from the time domain to the frequency domain using FFT. Control is passed to step 710 and process 700 continues.
[0074] In step 710, the audio is analyzed looking for the presence of speech. Control is passed to step 715 and process 700 continues, [0075] In step 715, it is determined whether the speech is detected or not. If it is determined that the speech has been detected, then control is passed to step 720 and process 700 continues. If it is determined that speech was not detected, then control is passed to step 710 and process 700 continues.
[0076] The determination in step 715 can be made based on any suitable criteria. For example, speech detection is performed by TX speech detectors, RX speech detectors and gibberish detectors (as described above in Figure 3). Detectors are generally designed around the principle that, if the signal level is higher than a certain threshold, then it is reasonable to assume that the speech is present. The value of these thresholds must be chosen carefully from
18/25 analysis of data collected from typical use cases of the echo canceller. If the thresholds are too small, the background noise detected by the microphone can result in false detection. If the threshold value is too large, part of the speech or part of the echo may not be detected. If near-end speech is present, the level of near-end speech can be considered to be greater than the near-end speech threshold or that the spectral variation of the near-end speech is greater than the spectral variation threshold. If the far-end speech is present, the level of the far-end speech can be considered to be greater than the far-end speech threshold or that the spectral range of the far-end speech is greater than the spectral range threshold. Thresholds may be higher than those for near-end speech.
[0077] In step 720, the echo delay is estimated and the echo model is validated. For example, the algorithm as described above is used to estimate the delay. The validation of the echo model is statistical and can be based on the assumption that there is an echo if for / V consecutive delays estimated by the module Echo Delay Estimation and Similarity Measure ”K of these delays / V have exactly the same value with the K / V ratio greater than 75%, control is passed to step 725 and process 700 continues.
[0078] In step 725, it is determined whether the echo is present or not. If it is determined that the echo has been detected, then the control is passed to step 730 and process 700 continues. If it is determined that an echo has not been detected, then control is passed to step 710 and process 700 continues.
[0079] The determination in step 725 can be made based on any suitable criteria. For example, algorithms as
19/25 described above can be used to determine whether the echo is detected along with the statistical analysis or not, as described above.
[0080] In step 730, it is determined whether gibberish is present or not. If it is determined that gibberish is present, then control is passed to step 735 and process 700 continues. If it is determined that the gibberish was not detected, then the control is passed to step 740 and process 700 continues.
[0081] The determination in step 730 can be made based on any suitable criteria. For example, during the gibberish, in order to avoid any degradation in the signal, when the person on the far end is speaking, the echo can be multiplied by an attenuation factor «with <« <1, the output is then defined by:
[0082] [0083] The «constant can control the amount of echo that is removed during gibberish. If '* 0, no echoes are removed in total during the gibberish, which is generally the case during gibberish in most systems. A system value of «- 0.5 during gibberish and 1 at other times allows better control over the system. In at least one personification, a signal level of 3 d8 above the echo level is considered as an indicator for gibberish, gibberish is presumed to be present if far-end speech is present, near-end speech is present, and the speech level of the far end is greater than the echo level, + 3dB. One reason to change the amount of echo removed during gibberish is to avoid or reduce audible artifacts in the signal after echo removal.
[0084] In step 735, parameters are computed for echo in the presence of gibberish. The control is then passed to step 745 and the process 700 continues.
20/25 [0085] In step 740, parameters are computed for echo in the absence of gibberish. The control is then passed to step 745 and process 700 continues.
[0086] In step 745, echo subtraction is performed. Once the delay has been accurately determined, the echo is canceled by applying an RX signal transfer function. The transfer function is a ratio between the TX and RX signals in the frequency (spectral) domain and can be represented as (TX / RX).
[0087] This ratio is obtained from a histograms, by choosing the one corresponding to the most likely delay value. Figure 6 is a modality of histograms, usually indicated at 600. In at least one modality, echo cancellation is performed in the frequency domain, using a statistical approach that is more effective for long delays and multiple echoes. Performance in the frequency domain eliminates the need for adaptive filtering using extensive computations to calculate the filter coefficients and other non-linear operations to completely remove the echo, in addition to a settling time for the convergence of filter coefficients. Control is passed to step 750 and process 700 continues.
[0088] In step 750, the echo is controlled and the echo model is updated, The control is passed to step 755 and process 700 continues.
[0089] In step 755, it is determined whether the echo is still present or not. If it is determined that the echo is still present, then the control is passed back to step 730 and the process 700 continues. If it is determined that echo is not present, then control is passed back to step 705 and process 700 continues.
[0090] The determination in step 755 can be made based on any appropriate criteria, such as the methods as described above. As the control is passed back to step 705, the parameters are reset in the echo model and the process continues.
21/25 [0091] In at least one modality, echo cancellation is necessary in interactive voice response (IVR) systems that use automatic speech recognition (ASR). To avoid an automated message echo being played back to the person calling from triggering the speech detector on the ASR mechanism, echo cancellation plays an important role. If echo is present, this would result in repeated interruptions to the automated message, thus a bad user experience. Such an echo, if not canceled, can be perceived by the system as a response from a user who can trigger a false interaction.
[0092] Figure 8 is an illustration of a modality of convergence time 805 in an echo changer, usually indicated in 800, because, during convergence the level of echo signal 810 is still relatively high, this can trigger the detector discourse of the ASR mechanism, confusing the echo with a user response. To prevent this from happening, in one mode, the echo changer output is delayed by the expected number of frames required to detect the presence of echo (convergence time). If an echo is detected, the echo is removed retroactively from the memory frames, which are then sent to the ASR mechanism. As the convergence time in the present invention is short, the Introduced delay does not appreciably compromise the user experience of voice dialogue. To further reduce the perception of delay, at least one mode stops automated messages ('' interruption ') based on a speech activity signal derived by the present invention from echo-changer status information instead of detector information of speech in the ASR mechanism. In another embodiment, memories from which the echo was removed retroactively can be fed to the downstream consumer (such as an ASR mechanism) more
22/25 faster than real-time to reduce or eliminate the delay for subsequent speech frames.
[0093] Figure 9 is an example illustration of an echo cancellation modality with low or no convergence time, usually indicated at 900. The total output result can become more uniform, as shown. In at least one modality, the T value ₍ reflecting the convergence time is equal to 150ms.
[0094] In at least one modality, acoustic echo over a PSTN network would generally not show a delay greater than 500ms. However, in VoIP networks, delays can be longer than that. Figure 10 illustrates an echo mode through a VoIP network, usually indicated at 1000, and the communication between two phones 1005a, 1005b. The audio signal passes through the network 1015 to travel between telephone 1005a and telephone 1005b. The network 1015 can also be connected to other devices, such as, but not limited to, a computer 1010a, 1010b. Other examples of devices may include servers, fax machines, etc. The network presents its own audio disturbances 1020 such as packet loss 1025, delay 1030 and jitter 1035.
[0095] Delay 1030 specifies the amount of time it can take for some data to travel across the network from one point to another. Several other known sources of delays can include: processing delay, queuing delay, transmission delay and propagation delay. Processing delay can be the time that routers take to process the packet. Queue delay can be the time that the packet spends in the routing queues. Transmission delay can be the time it takes to get the packet to the link. Propagation delay can include the time it takes for a signal to reach its destination. The sum of all these delays, which represents the total delay, can be added to the actual echo delay to form the final echo delay across the network. Q
23/25 total delay found can easily exceed 1 s. The present invention can handle much longer delays than Is.
[0006] Another disturbance introduced by the network 1015 is caused by jitter 1035. In at least one mode, jitter 1035 measures the variation in latency along the network, which can show a substantial variation in the delay seen by the echo canceling algorithm. These sudden variations in the delay introduced by the 1035 jitter are difficult to deal with and may temporarily cause the algorithm to lose track of the echo. The search interval mechanism for echo delay allows to deal with echo with long delay, as well as the restricted search that compensates for the 1035 flicker effect after the echo has occurred. If echo is found, then an echo search can occur over an interval of ± 250ms. If the flicker 1035 or variation in latency across the 1015 network is greater than ± 250ms, the search for the echo delay will start over in the 1.5 s interval.
[0097] Another common signal degradation introduced by the network is the loss of 1030 packets. The loss of 1030 packets can occur when one or more data packets traveling through the 1015 network fail to reach the destination. The loss of 1030 packets can be caused by a number of factors such as signal degradation across the network due to weakening by multipath, packet drop due to channel congestion, or corrupted packets discarded in transit.
[0098] For Ildar with the 1030 packet loss, the echo detection process needs to be robust and cannot rely solely on a simple similarity measurement. In at least one modality, the use of statistics through the histogram method makes the system robust for packet loss as the decision is made based on the information that is accumulated over several data frames. Some frames in the search window that can be affected by
24/25 packet loss will not normally change the statistics to the point at which the system loses the echo track.
[0099] In at least one modality, the calculation of similarity and model parameters in different stories make use of previous partial values for overlapping frames in previous moments. An accurate delay value is calculated only if an echo is present. Once an echo has been determined and the echo characteristics have not changed over time, the calculations necessary for determining the delay are not repeated, although echo cancellation still needs to be performed with the delay estimate locked in. If the echo characteristics change over time, the EC unlocks the delay estimates and a new round of model parameters is evaluated. The disappearance of echo will cause a reset of the model parameters and the echo canceller will automatically reduce the number of operations. These optimizations considerably reduce the number of computational operations performed by the EC.
[00190] In other modalities, if multiple echoes are present in the received signal (RX), the delay histogram has several peaks. Estimates for separate echoes can be made and they can be subtracted in sequence in the same way. Overlapping echo bands may require that the separate transfer functions be merged to avoid distortion from one echo cancellation to the other.
[00101] In at least one modality, the similarity calculation can be optimized, focusing on bands of interest if the signals of near and far ends have spectral density concentrated in specific regions. This significantly reduces computational overhead due to the highly repetitive nature of the similarity calculation across the entire far-end channel, an aspect that can become very important when searching for long delays.
25/25 [00102] Although the invention has been illustrated and described in detail in the figures and description above, it is to be considered as illustrative and not restrictive in character, implying that only the preferred modality has been shown and described and that all equivalents, changes and modifications that come within the spirit of the inventions as described in this document and / or by the following claims that are intended to be protected.
[00103] Therefore, the appropriate scope of the present invention should be determined only by the broadest interpretation of the appended claims to cover all such modifications, as well as all relationships equivalent to those illustrated in the figures and described in the specification, [00104] Although If two very narrow claims are made in this document, it should be recognized that the scope of the present invention is much broader than that presented by the claims. It is intended that the broader claims be submitted in an order that claims the priority benefit of that order.

权利要求:
Claims (15)
[1]
1. Acoustic echo cancellation method, characterized by the fact that it comprises the steps of:
The. transform an audio signal from the time domain to the frequency domain using the Fast Fourier Transform;
B. initialize parameters of the echo model;
ç. analyze said audio signal for the speech;
d. detect the presence of speech in the audio signal;
and. estimate the echo delay in the audio signal and validate said echo model for the audio signal;
f. detect the presence of echo in the audio signal;
g. detect the presence of gibberish in the audio signal;
H. determine parameters for at least one of: echo with gibberish and echo;
i. subtract the echo from the audio signal;
j. update said echo model; and
k. determine whether the presence of echo is reduced in the audio signal.
[2]
2. Method according to claim 1, characterized by the fact that step (k) further comprises the step of repeating the method of claim 1, if it is determined which echo will be reduced.
[3]
Method according to claim 1, characterized by the fact that step (k) additionally comprises the step of repeating the method of claim 1 of step (f) if it is determined that the echo will not be reduced.
[4]
4. Method, according to claim 1, characterized by the fact that said transformation is carried out from the time domain to the frequency domain.
[5]
5. Method, according to claim 1, characterized by the fact that the step of analyzing the audio signal for speech additionally
11/11
2/3 comprises the step of determining whether the signal level in the audio signal meets a threshold.
[6]
6. Method, according to claim 1, characterized by the fact that the stage of audio analysis for speech additionally comprises the stage of analysis of variability in a spectrum of consecutive frames and in an estimated signal power.
[7]
7. Method, according to claim 1, characterized by the fact that the step of detecting the presence of speech is performed by one or more of: near-end speech detector, far-end speech detector and gibberish detector.
[8]
8. Method according to claim 1, characterized by the fact that the step of estimating echo delay and validating an echo model additionally comprises the steps of:
l. measure and add a distance for each audio frame;
m. estimate the echo delay mathematically; and
n. validate the estimate using a statistical method.
[9]
9. Method, according to claim 8, characterized by the fact that the step of estimating the echo delay is performed using a mathematical expression for the measurement of similarity, in which the mathematical expression comprises:
Díferenèe (í) = cerni = 1, ...... V - à '
[10]
10. Method according to claim 9, characterized in that the step of determining whether the echo is present additionally comprises the step of determining that the echo is present, if the similarity measurement meets a threshold.
[11]
11. Method according to claim 1, characterized by the fact that the step of detecting the presence of gibberish additionally comprises the step of determining that a signal level above the level of
9/11
3/3 echo is present, near-end speech is present, and far-end speech is present.
[12]
12. Method according to claim 11, characterized by the fact that the signal level is 3dB.
[13]
13. Method according to claim 1, characterized in that the step of subtracting the echo additionally comprises the step of applying a transfer function to a far-end signal.
[14]
14. Method according to claim 13, characterized by the fact that the transfer function is determined by a function of a ratio of a near-end signal and a far-end signal (s) in a spectral domain and analyzing a histogram.
[15]
15. Method, according to claim 14, characterized by the fact that the analysis of a histogram comprises the steps of:
O. store model parameter statistics in a circular memory for a number of frames;
P. determining an echo delay using frequency bands;
q. analyze this histogram for multimodal distributions; and
r. extract echo where multimodal distribution is present.

类似技术:

公开号 | 公开日 | 专利标题

AU2017245314B2|2019-04-11|System and method for acoustic echo cancellation

US8098812B2|2012-01-17|Method of controlling an adaptation of a filter

US7388954B2|2008-06-17|Method and apparatus for tone indication

US8139760B2|2012-03-20|Estimating delay of an echo path in a communication system

US20040001450A1|2004-01-01|Monitoring and control of an adaptive filter in a communication system

US7366118B2|2008-04-29|Echo cancellation

CN103516921A|2014-01-15|Method for controlling echo through hiding audio signals

KR20080026073A|2008-03-24|An efficient voice activity detector to detect fixed power signals

US9219456B1|2015-12-22|Correcting clock drift via embedded sin waves

Bershad et al.2005|Fast coupled adaptation for sparse impulse responses using a partial Haar transform

US20030235295A1|2003-12-25|Method and apparatus for non-linear processing of an audio signal

US8009825B2|2011-08-30|Signal processing

US7856098B1|2010-12-21|Echo cancellation and control in discrete cosine transform domain

US7711107B1|2010-05-04|Perceptual masking of residual echo

CN106716853B|2021-04-09|Method and arrangement in a DSL vectoring system

Trump2008|Detection of echo generated in mobile phones using pitch distance

EP1944877B1|2011-03-16|Method of modifying a residual echo

Kauffman2006|An algorithm to evaluate the echo signal and the voice quality in VoIP networks

Ferreira et al.2011|Echo Cancellation for Hands-Free Systems

Tangwongsan et al.2010|Echo cancellation in VoIP with improved least square lattice method

Gordy2007|Robust echo cancellation in harsh environments

Nguyen Ngoc et al.2009|Implementation of the LMS and NLMS algorithms for Acoustic Echo Cancellationin teleconference systemusing MATLAB

Kmita2011|Echo cancellation in VoIP using digital adaptive filters

Cole et al.1986|A high performance digital voice echo canceller on a single TMS32020

TW200417167A|2004-09-01|Communication system and method therefor

同族专利:

公开号 | 公开日

NZ706162A|2018-07-27|

EP2912833A1|2015-09-02|

AU2017245314A1|2017-11-02|

AU2013334829A1|2015-04-09|

AU2017203053A1|2017-06-01|

CA2888894A1|2014-05-01|

WO2014066367A1|2014-05-01|

WO2014066367A8|2014-07-31|

CA2888894C|2021-08-17|

EP2912833A4|2016-06-22|

AU2017203053B2|2017-07-13|

JP2016502779A|2016-01-28|

CL2015001037A1|2015-08-21|

AU2013334829B2|2017-06-15|

CA3073412A1|2014-05-01|

US9628141B2|2017-04-18|

US20140112467A1|2014-04-24|

JP6291501B2|2018-03-14|

AU2017245314B2|2019-04-11|

EP2912833B1|2017-06-21|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US5450484A|1993-03-01|1995-09-12|Dialogic Corporation|Voice detection|

SG71035A1|1997-08-01|2000-03-21|Bitwave Pte Ltd|Acoustic echo canceller|

DE19831320A1|1998-07-13|2000-01-27|Ericsson Telefon Ab L M|Digital adaptive filter for communications system, e.g. hands free communications in vehicles, has power estimation unit recursively smoothing increasing and decreasing input power asymmetrically|

US6792107B2|2001-01-26|2004-09-14|Lucent Technologies Inc.|Double-talk detector suitable for a telephone-enabled PC|

US7236929B2|2001-05-09|2007-06-26|Plantronics, Inc.|Echo suppression and speech detection techniques for telephony applications|

US7155018B1|2002-04-16|2006-12-26|Microsoft Corporation|System and method facilitating acoustic echo cancellation convergence detection|

GB2389286A|2002-05-28|2003-12-03|Mitel Knowledge Corp|Echo cancellation|

JP4155774B2|2002-08-28|2008-09-24|富士通株式会社|Echo suppression system and method|

US7420937B2|2002-12-23|2008-09-02|Broadcom Corporation|Selectively adaptable far-end echo cancellation in a packet voice system|

EP1443498B1|2003-01-24|2008-03-19|Sony Ericsson Mobile Communications AB|Noise reduction and audio-visual speech activity detection|

JP4403776B2|2003-11-05|2010-01-27|沖電気工業株式会社|Echo canceller|

JP4533427B2|2005-02-21|2010-09-01|富士通株式会社|Echo canceller|

US7852950B2|2005-02-25|2010-12-14|Broadcom Corporation|Methods and apparatuses for canceling correlated noise in a multi-carrier communication system|

US7856098B1|2005-09-15|2010-12-21|Mindspeed Technologies, Inc.|Echo cancellation and control in discrete cosine transform domain|

CN1984102A|2005-12-13|2007-06-20|华为技术有限公司|Device and method for eliminating electric echo|

US7792281B1|2005-12-13|2010-09-07|Mindspeed Technologies, Inc.|Delay estimation and audio signal identification using perceptually matched spectral evolution|

US7852792B2|2006-09-19|2010-12-14|Alcatel-Lucent Usa Inc.|Packet based echo cancellation and suppression|

JP4916394B2|2007-07-03|2012-04-11|富士通株式会社|Echo suppression device, echo suppression method, and computer program|

JP2009207021A|2008-02-29|2009-09-10|Yamaha Corp|Acoustic echo canceler|

US8310937B2|2008-05-28|2012-11-13|Centurylink Intellectual Property Llc|Voice packet dynamic echo cancellation system|

US8600038B2|2008-09-04|2013-12-03|Qualcomm Incorporated|System and method for echo cancellation|

JP2010118793A|2008-11-11|2010-05-27|Oki Electric Ind Co Ltd|Propagation delay time estimator, program and method, and echo canceler|

JP5332733B2|2009-03-03|2013-11-06|沖電気工業株式会社|Echo canceller|

US8824666B2|2009-03-09|2014-09-02|Empire Technology Development Llc|Noise cancellation for phone conversation|

US20110013766A1|2009-07-15|2011-01-20|Dyba Roman A|Method and apparatus having echo cancellation and tone detection for a voice/tone composite signal|

RU2011103938A|2011-02-03|2012-08-10|ЭлЭсАй Корпорейшн |CONTROL OF ACOUSTIC ECHO SIGNALS BASED ON THE TIME AREA|AU6764998A|1997-03-19|1998-10-12|Cultor Food Science, Inc.|Polymerization of mono-and disaccharides using low levels of mineral acids|

WO2009125492A1|2008-04-11|2009-10-15|三菱電機株式会社|Power divider|

CN103051818B|2012-12-20|2014-10-29|歌尔声学股份有限公司|Device and method for cancelling echoes in miniature hands-free voice communication system|

US9270830B2|2013-08-06|2016-02-23|Telefonaktiebolaget L M Ericsson |Echo canceller for VOIP networks|

US9420114B2|2013-08-06|2016-08-16|Telefonaktiebolaget Lm Ericsson |Echo canceller for VOIP networks|

US9100090B2|2013-12-20|2015-08-04|Csr Technology Inc.|Acoustic echo cancellationfor a close-coupled speaker and microphone system|

US9819446B2|2014-10-29|2017-11-14|FreeWave Technologies, Inc.|Dynamic and flexible channel selection in a wireless communication system|

US10149263B2|2014-10-29|2018-12-04|FreeWave Technologies, Inc.|Techniques for transmitting/receiving portions of received signal to identify preamble portion and to determine signal-distorting characteristics|

US9787354B2|2014-10-29|2017-10-10|FreeWave Technologies, Inc.|Pre-distortion of receive signal for interference mitigation in broadband transceivers|

US10033511B2|2014-10-29|2018-07-24|FreeWave Technologies, Inc.|Synchronization of co-located radios in a dynamic time division duplex system for interference mitigation|

DE112015007019B4|2015-11-16|2019-07-25|Mitsubishi Electric Corporation|Echo sounding device and voice telecommunication device|

US10225395B2|2015-12-09|2019-03-05|Whatsapp Inc.|Techniques to dynamically engage echo cancellation|

DE102016119471A1|2016-10-12|2018-04-12|Deutsche Telekom Ag|Methods and devices for echo reduction and functional testing of echo cancellers|

CN109215672B|2017-07-05|2021-11-16|苏州谦问万答吧教育科技有限公司|Method, device and equipment for processing sound information|

JP6977772B2|2017-07-07|2021-12-08|ヤマハ株式会社|Speech processing methods, audio processors, headsets, and remote conversation systems|

CN108198551A|2018-01-15|2018-06-22|深圳前海黑鲸科技有限公司|The processing method and processing device of echo cancellor delay|

KR20200033617A|2018-09-20|2020-03-30|현대자동차주식회사|In-vehicle apparatus for recognizing voice and method of controlling the same|

US11122160B1|2020-07-08|2021-09-14|LenovoPte. Ltd.|Detecting and correcting audio echo|

法律状态:
2018-11-21| B06F| Objections, documents and/or translations needed after an examination request according [chapter 6.6 patent gazette]|

2020-06-09| B06U| Preliminary requirement: requests with searches performed by other patent offices: procedure suspended [chapter 6.21 patent gazette]|

2020-06-09| B15K| Others concerning applications: alteration of classification|Free format text: AS CLASSIFICACOES ANTERIORES ERAM: H04M 9/08 , H04M 1/24 Ipc: H04B 3/23 (2006.01), H04M 9/08 (2006.01), G10L 21/ |

2021-10-19| B350| Update of information on the portal [chapter 15.35 patent gazette]|

优先权:

申请号 | 申请日 | 专利标题

US201261717156P| true| 2012-10-23|2012-10-23|

PCT/US2013/066144|WO2014066367A1|2012-10-23|2013-10-22|System and method for acoustic echo cancellation|

[返回顶部]